<html>
<head>
<link href="css/assignments.css" rel="stylesheet" type="text/css">
<title>Trends</title>
</head>

<body>
<h2> Project 2: Twitter Trends </h2>

<blockquote style="text-align:center;">
  <div class="c1"><img src="texas.png"></div>

  <p>What do people tweet?<br>
    Draw their feelings on a map <br>
    to discover <i>trends</i>.</p>
</blockquote>

<h3>Introduction</h3>

<p>In this project, you will develop a geographic visualization of twitter data
across the USA. You will need to use dictionaries, lists, and data abstraction
techniques to create a modular program. This project uses ideas from Sections
2.1-2.4 of the lecture notes.</p>

<p>The map displayed above depicts how the people in different states feel
about Texas.  This image is generated by:

<ol>
  <li> Collecting public Twitter posts (tweets) that have been tagged with
  geographic locations and filtering for those that contain the "texas" query
  term,
  <li> Assigning a sentiment (positive or negative) to each tweet, based on all
  of the words it contains,
  <li> Aggregating tweets by the state with the closest geographic center, and
  finally
  <li> Coloring each state according to the aggregate sentiment of its tweets.
  Red means positive sentiment; blue means negative.
</ol>

<p>The details of how to conduct each of these steps is contained within the
project description.  By the end of this project, you will be able to map the
sentiment of any word or phrase. There are two alternative zip archives related
to this project:

<ul>
  <li> The <a href="trends.zip">full project</a>, which contains all the starter
  code and all data (<b>warning: 81 MB</b>).

  <li> A <a href="trends_small.zip">small version</a> that contains all the
  starter code, but only a small subset of the data.  You can complete the
  project in its entirety using this archive, and you have the option to add
  more data at the end.
</ul>

<p>The project uses several files, but all of your changes will be made to the
first one.</p>

<table cellpadding="10" cellspacing="2">
  <tr>
    <td>
      <p><code><a href="trends.py.html">trends.py</a></code></p>
    </td>

    <td>
      <p>A starter implementation of the main project file.</p>
    </td>
  </tr>

  <tr>
    <td>
      <p><code><a href="geo.py.html">geo.py</a></code></p>
    </td>

    <td>
      <p>Geographic positions, 2-D projection equations, and geographic
      distance functions.</p>
    </td>
  </tr>

  <tr>
    <td>
      <p><code><a href="maps.py.html">maps.py</a></code></p>
    </td>

    <td>
      <p>Functions for drawing maps.</p>
    </td>
  </tr>


  <tr>
    <td>
      <p><code><a href="data.py.html">data.py</a></code></p>
    </td>

    <td>
      <p>Functions for loading Twitter data from files.</p>
    </td>
  </tr>

  <tr>
    <td>
      <p><code><a href="graphics.py.html">graphics.py</a></code></p>
    </td>

    <td>
      <p>A simple Python graphics library.</p>
    </td>
  </tr>

  <tr>
    <td>
      <p><code><a href="ucb.py.html">ucb.py</a></code></p>
    </td>

    <td>
      <p>Utility functions for 61A.</p>
    </td>
  </tr>
</table>

<p>The <a href="data">data</a> directory contains all the data files needed for
the project. The above zip archives contain this directory, but if you download
the individual project files above, you will need to download the data directory
as well.</p>

<h3>Logistics</h3>

<p>This is a two-week project. You'll work in a team of two people, and you can
complete all problems together with your partner.</p>

<p>Start early! Feel free to ask for help early and often. The course staff is
here to assist you, but we can't help everyone an hour before the deadline. <a
  href="http://www.piazza.com">Piazza</a> awaits. You are not alone! </p>

<p>In the end, you and your partner will submit one project. There are 19 possible
points (16 for correctness and 3 for composition). You only need to submit the
file <code><a href="trends.py.html">trends.py</a></code>. You do not need to modify any other files for this
project. To submit the project, change to the directory where the
<code><a href="trends.py.html">trends.py</a></code> file is located and run <code>submit proj2</code>. </p>


<h3>Phase 1: The Feelings in Tweets</h3>

<p>In this phase, you will create an abstract data type for tweets, split the
text of a tweet into words, and calculate the amount of positive or negative
feeling in a tweet.</p>

<h4>Tweets</h4>

<p>First, you will implement an abstract data type for Tweets. The constructor
<code>make_tweet</code> is defined at the top of <code><a href="trends.py.html">trends.py</a></code>.
<code>make_tweet</code> returns a python dictionary with
the following entries:</p>

<pre>
  text:      a string, the text of the tweet, all in lowercase
  time:      a datetime object, when the tweet was posted
  latitude:  a floating-point number, the latitude of the tweet's location
  longitude: a floating-point number, the longitude of the tweet's location
</pre>

<p><b>Problem 1</b> (1 pt). Implement the <code>tweet_words</code> and
<code>tweet_time</code> selectors.  Call <code>extract_words</code> to list the
words in the text of a tweet.

<p><b>Problem 2</b> (1 pt). Implement the <code>tweet_location</code> selector,
which  returns a <code>position</code>. Positions are another abstract data
type, defined at the top of <code><a href="geo.py.html">geo.py</a></code>.  Make sure that you understand
how to manipulate positions; they play an important role in this project.

<p>When you complete problems 1 and 2, the doctest for <code>make_tweet</code>
should pass.

<pre>
python3 trends.py -t make_tweet
</pre>

<p><b>Problem 3</b> (1 pt). Improve the <code>extract_words</code>
function as follows:  Assume that a word is any consecutive substring of
<code>text</code> that consists only of ASCII letters.  The string
<code>ascii_letters</code> in the <code>string</code> module contains all
letters in the ASCII character set. The <code>extract_words</code> function
should list all such words in order and nothing else.

<p>When you complete this problem, the doctest for <code>extract_words</code>
should pass.

<pre>
python3 trends.py -t extract_words
</pre>

<p><b>Problem 4</b> (1 pt). Implement the sentiment abstract data type, which
represents a sentiment value that may or may not exist. The constructor
<code>make_sentiment</code> takes either a numeric value within the interval -1
to 1, or <code>None</code> to indicate that the value does not exist. Implement
the selectors <code>has_sentiment</code> and <code>sentiment_value</code> as
well.  <i>You may use any representation you choose, but the rest of your
    program should not depend on this representation.</i>

<p>When you complete this problem, the doctests for <code>make_sentiment</code>
and <code>get_word_sentiment</code> should pass. You can also call the
<code>print_sentiment</code> function to print the sentiment values of all
sentiment-carrying words in a line of text.

<pre>
python3 trends.py -t make_sentiment
python3 trends.py -t get_word_sentiment
python3 trends.py -p computer science is my favorite!
python3 trends.py -p life without lambda: awful or awesome?
</pre>

<p><b>Problem 5</b> (1 pt). Implement <code>analyze_tweet_sentiment</code>,
which takes a tweet (of the abstract data type) and returns a
<code>sentiment</code>.  Read the docstrings for
<code>get_word_sentiment</code> and <code>analyze_tweet_sentiment</code> to
understand how the two functions interact. <i>Your implementation should not
    depend on the representation of a sentiment!</i>.

<p>When you complete this problem, the doctests for
<code>analyze_tweet_sentiment</code> should pass.

<pre>
python3 trends.py -t analyze_tweet_sentiment
</pre>

<p>You can also check your code for data abstraction violations by
running the following command:</p>

<pre>
python3 trends.py -a analyze_tweet_sentiment
</pre>

<h3>Phase 2: The Geometry of Maps</h3>

<h4>Positions</h4>

<p>We will use the position abstract data type to represent geographic
latitude-longitude positions on the Earth. The data abstraction, defined at the
top of <code><a href="geo.py.html">geo.py</a></code>, has the constructor <code>make_position</code> and
the selectors <code>latitude</code> and <code>longitude</code>.

<p>In this phase, you will write two functions that together determine the
centers of U.S. states. The shape of a state is represented as a list of
polygons.  Some states (e.g. Hawaii) consist of multiple polygons, but most
states (e.g.  Colorado) consist of only one polygon (still represented as a
length-one list).</p>

<p><b>Problem 6</b> (2 pt). Implement <code>find_centroid</code>, which takes a
polygon and returns three values: the coordinates of its centroid and its area.
The input polygon is represented as a list of <code>position</code> abstract
data types, which are the consecutive vertices of its perimeter.  The first
vertex is always identical to the last.

<p>The centroid of a two-dimensional shape is its center of balance, defined as
the intersection of all straight lines that evenly divide the shape into
equal-area halves.  <code>find_centroid</code> returns the centroid and area of
an individual polygon.

<p>The formula for computing the <a
  href="http://en.wikipedia.org/wiki/Centroid#Centroid_of_polygon">centroid of a
  polygon</a> appears on Wikipedia. The formula relies on vertices being
consecutive (either clockwise or counterclockwise; both give the same answer),
a property that you may assume always holds for the input.

<p>The area of a polygon is never negative.  Depending on how you compute the
area, you may need to use the built-in <code>abs</code> function to return a
non-negative number.</p>

<p>When you complete this problem, the doctest for <code>find_centroid</code>
should pass.

<pre>
python3 trends.py -t find_centroid
</pre>

<p>You can also check your code for data abstraction violations with
the following command:</p>

<pre>
python3 trends.py -a find_centroid
</pre>

<p><b>Problem 7</b> (2 pt). Implement <code>find_state_center</code>, which takes a
state represented by a list of polygons and returns a position, its centroid.

<p>The centroid of a collection of polygons can be computed by <a
    href="http://en.wikipedia.org/wiki/Centroid#By_geometric_decomposition">geometric
    decomposition</a>.  The centroid of a shape is the weighted average of the
centroids of its component polygons, weighted by their area.

<p>When you complete this problem, the doctest for <code>find_state_center</code>
should pass.

<pre>
python3 trends.py -t find_state_center
</pre>

<p>You can also check your code for data abstraction violations with
the following command</p>

<pre>
python3 trends.py -a find_state_center
</pre>

<p>Once you are finished, <code>draw_centered_map</code> will draw the
<code>10</code> states closest to a given state (including that state).

<pre>
python3 trends.py -d CA
</pre>


<h3>Phase 3: The Mood of the Nation</h3>

<h4>States</h4>

<p>The name <code>us_states</code> is bound to a dictionary containing the
shape of each U.S. state, keyed by its two-letter postal code. You can use
the keys of this dictionary to iterate over all the U.S. states.

<p>In this phase, you will write functions to determine the state that a tweet
is coming from, group tweets by state, and calculate the average positive or
negative feeling in all the tweets associated with a state.</p>

<p><b>Problem 8</b> (1 pt). Implement <code>find_closest_state</code>, which
returns the two-letter postal code of the state that is closest to the location
of a tweet. Use the <code>geo_distance</code> function (provided in
<code><a href="geo.py.html">geo.py</a></code>) to calculate the shortest distance in miles between two
positions.

<p>When you complete this problem, the doctests for
<code>find_closest_state</code> should pass.

<pre>
python3 trends.py -t find_closest_state
</pre>

<p><b>Problem 9</b> (1 pt). Implement <code>group_tweets_by_state</code>, which
takes a sequence of tweets and returns a dictionary.  The keys of the returned
dictionary are state names (two-letter postal codes), and the values are lists
of tweets that appear closer to that state's center than any other.

<p>When you complete this problem, the doctests for
<code>group_tweets_by_state</code> should pass.

<pre>
python3 trends.py -t group_tweets_by_state
</pre>

<p><b>Problem 10</b> (1 pt). As an exercise, implement
<code>most_talkative_states</code>, which returns a list of the top five states 
with the largest number of tweets containing 'term' in descending order (from
most to least).  The list should contain the 5 states in the form a two-element 
tuple, with the first entry being the state, and the second entry being the 
number of tweets i.e. <code>('CA', 23)</code>.

<p><i>Hint:</i> The following methods may prove useful.
<ul>
  <li><i>dict</i>.pop('<i>key</i>') removes the <i>key</i> and its <i>value</i> pair from the
  dictionary <i>dict</i> and returns <i>value</i>.</li>
  <li>del <i>dict</i>['<i>key</i>'] removes the a 'key'-'value' pair from the
  dictionary <i>dict</i></li>
</ul>

<p>When you complete this problem, the doctests for
<code>most_talkative_states</code> should pass.

<pre>
python3 trends.py -t most_talkative_states
</pre>

<p><b>Problem 11</b> (2 pt). Implement <code>average_sentiments</code>.  This
function takes the dictionary returned by <code>group_tweets_by_state</code>
and also returns a dictionary.  The keys of the returned dictionary are the
state names (two-letter postal codes), and the values are average sentiment
values for all the tweets that <b>have a sentiment value</b> in that state.

<p>If a state has no tweets with sentiment values, leave it out of the returned
dictionary entirely.  Do not include a state with no sentiment using a zero
sentiment value.  Zero represents neutral sentiment, not unknown sentiment.
States with unknown sentiment will appear gray, while states with neutral
sentiment will appear white.

<p>You should now be able to draw maps that are colored by sentiment
corresponding to tweets that contain a given term.

<pre>
python3 trends.py -m sandwich
python3 trends.py -m obama
python3 trends.py -m texas
python3 trends.py -m my life
</pre>

If you downloaded the small version of the project, you will only be able to map
these four terms.  If you would like to map any term, you will need to download
this <a href="data/all_tweets.txt">Twitter data file</a> and place it in the
<code>data</code> directory of your project.


<h3>Phase 4: Into the Fourth Dimension</h3>

<p>In this last section, you will take into account the fourth dimension:
time. Each tweet has a <code>datetime</code> object, which represents the time
that the tweet was posted.

<p>The provided <code>draw_map_by_hour</code> function visualizes the tweets
that were posted during each hour of the day. For example, "sandwich" tweets
appear most positive at 10:00pm: late night snack!

<p><b>Problem 12</b> (2 pt). Implement <code>group_tweets_by_hour</code>. which
takes a sequence of tweets and returns a dictionary. The keys of the returned
dictionary are the integers 0 through 23, representing the 24 hours of the day.
The value for each key is a list of the tweets that were posted during that
hour.

<p>To get started, read the online <a
    href="http://docs.python.org/py3k/library/datetime.html#datetime.datetime">documentation</a>
on <code>datetime</code> objects.  When you are finished, you can visualize the
changing sandwich sentiments over time.

<pre>
python3 trends.py -b sandwich
</pre>

<p><i>Congratulations!</i> One more 61A project completed.


<h3>Extensions</h3>

<p>These extensions are optional and ungraded.  In this class, you are welcome
to program just for fun.  If you build something interesting, come to office
hours and give us a demo. However, please do not change the behavior or
signature of the functions you have already implemented.

<ul>
  <li>Punctuation can be an indicator of sentiment as well.  Add an emoticon
  (smiley) detector that attributes positive sentiment to happy faces
  <code>:-)</code> and negative sentiment to sad ones.

  <li>In the standard implementation, some tweets are associated with different
  states than the ones in which they occurred.  For example, all tweets from
  Manhattan are assigned to New Jersey.  New Yorkers would be appalled! Write a
  function <code>find_containing_state</code> that finds the state that
  actually contains a tweet position.

  <li>The <code><a href="graphics.py.html">graphics.py</a></code> package supports animation.  Use the
  <code>slide_shape</code> method to have states and dots slide into place.

  <li><a href="http://norvig.com/spell-correct.html">Correct the spelling</a> of
  tweets before you compute their sentiment.

  <li>Calculate the total average sentiment of the whole country for a term and
  display that using the map.py and graphics.py package (try and understand the
  implementation of draw_most_talkative_states then use it as a foundation and
  modify as needed)
</ul>

<p><b>Acknowledgements:</b> Aditi Muralidharan developed this project with John
DeNero. Hamilton Nguyen extended it.

</body>
</html>
